Skip to content

refactor: adopt mixin chain + emit per-phase spans#49

Merged
timzsu merged 6 commits into
mainfrom
zsu/rfc48-omni-mixins
May 15, 2026
Merged

refactor: adopt mixin chain + emit per-phase spans#49
timzsu merged 6 commits into
mainfrom
zsu/rfc48-omni-mixins

Conversation

@timzsu
Copy link
Copy Markdown
Collaborator

@timzsu timzsu commented May 14, 2026

Purpose

The four omni executors predated the executor mixin chain (InferenceMixinDataMixinGovernanceMixin) — the last family still on a plain-Executor base. This PR moves them onto the chain so they emit OTel traces like the inference and training executors, adds three per-phase spans inside each run(), routes prompts through the mixin's data path, and uploads spans to the server's /traces endpoint so HTTP-destination workers don't strand their trace JSONL on the worker filesystem. RFC #48 omni item.

Changes

  • omni_executor_base.py: OmniExecutorBase inherits (InferenceMixin, Executor). The executor-local collect_text_inputs helper is removed; prompts now come from DataMixin._collect_prompts_for_spec, with each call site narrowing PromptInput → str inline and raising ExecutionError if any item is not a str (omni executors don't consume the chat-message form).
  • omni_text2{image,speech,audio,general}_executor.py: each run() wraps self._run_inner(...) in self._task_span(...), then calls maybe_upload_artifacts(...) and maybe_upload_traces(...) after the span exits — matches the vllm_executor / diffusers_executor shape. Doing the uploads after __exit__ is required so the root task span row is flushed into spans.jsonl before the trace upload reads it. Without maybe_upload_traces, omni span files stayed on the worker filesystem in HTTP-destination deployments and never reached the server's /api/v1/traces/workflows/{wfl}/spans endpoint.
  • Three SpanType.COMPUTE sub-spans inside the task span: model load (_ensure_omni), generation (the omni.generate loop / streaming generator), output postprocessing (artifact save loop). Attributes carry prompt_count, item_count, flowmesh.type=compute.
  • examples/templates/omni_text2{speech,audio,general}.yaml: migrated from data.text: "..." to the canonical data.type: list / items: [...] mixin shape.
  • tests/worker/test_omni_executor_inheritance.py: parametrized over the four executor classes + the base; asserts each is a subclass of InferenceMixin, DataMixin, GovernanceMixin. Catches future regressions of the base class hierarchy.

Test Plan

  • uv run pytest tests/worker/test_omni_executor_inheritance.py tests/server tests/shared tests/sdk tests/cli — 537 passed, mypy clean across the touched files.
  • Live e2e on one GPU worker, all four omni templates. For each: ok=True, expected artifact on disk (generated_tts.wav / generated_image_*.png / bgm.wav / narration.wav), spans include task, model load, generation, output postprocessing, prompts threaded through DataMixin._collect_prompts_for_spec.

Test Result

537 unit tests passed; mypy clean.

Live e2e (1 GPU worker, omni-mixins images):

Workflow total (s) model load generation output postprocessing other
omni_text2speech 147.03 119.72s (81.4 %) 27.30s (18.6 %) 0.00s (0.0 %) 0.01s
omni_text2image 105.33 90.74s (86.1 %) 13.92s (13.2 %) 0.65s (0.6 %) 0.02s
omni_text2audio 31.52 29.59s (93.9 %) 1.92s ( 6.1 %) 0.00s (0.0 %) 0.01s
omni_text2general 451.74 450.79s (99.8 %) 0.92s ( 0.2 %) 0.03s (0.0 %) 0.01s

Cold weights dominate; generation second; postprocessing negligible. Sub-spans sum to within 20 ms of the root in every case. Image, audio, general re-validated on the post-mixin-migration commit; results match the table above. Trace-upload fix is behavior-only for HTTP-destination workers and doesn't change wall-clock; local-stack mode reads spans off the shared docker volume regardless.


Pre-submission Checklist
  • I have read the contribution guidelines.
  • I have run pre-commit run --all-files and fixed any issues.
  • I have added or updated tests covering my changes (if applicable).
  • I have verified that uv run pytest tests/ passes locally.
  • If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker.
  • If I changed the SDK or CLI, I have verified the affected packages work (uv sync --all-packages --group ci --frozen).
  • If this is a breaking change, I have prefixed the PR title with [BREAKING] and described migration steps above.
  • I have updated documentation or config examples if user-facing behavior changed.

- OmniExecutorBase inherits (InferenceMixin, Executor) so the four
  omni executors pick up GovernanceMixin / DataMixin / InferenceMixin
  from the same chain the inference and training executors use.
- Each concrete omni executor wraps run() with self._task_span(...)
  so a 'task' root span is emitted with executor.name + workflow_id.
- Inside run(), three per-phase compute spans are added — 'model
  load' (_ensure_omni), 'generation' (the omni.generate call(s)),
  and 'output postprocessing' (artifact save + items build) — mirroring
  the vllm executor's tracing shape.
- New tests/worker/test_omni_executor_inheritance.py asserts the
  full mixin chain on each omni executor class as a compile-time
  guard against regression.

Live e2e on a single GPU worker against all four omni templates
(omni_text2{speech,image,audio,general}.yaml): each task reports
ok=True with the expected artifact, and spans.jsonl contains the
'task' root plus 'model load' / 'generation' / 'output postprocessing'
sub-spans.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu changed the title refactor(omni): adopt mixin chain + emit per-phase spans (RFC #48) refactor: adopt mixin chain + emit per-phase spans (RFC #48) May 14, 2026
timzsu added 2 commits May 14, 2026 10:08
…spec

Replace the executor-local collect_text_inputs helper with the mixin's
_collect_prompts_for_spec. Each omni executor now narrows PromptInput to
str inline and raises ExecutionError if any item is not a string. Templates
adopt the canonical data.type: list / items shape.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
Move maybe_upload_artifacts out of the task span and add a missing
maybe_upload_traces call right after, matching the vllm and diffusers
pattern. Without the trace upload, omni span JSONL stayed on remote
workers and never reached the server's /traces endpoint in HTTP mode.

Also extract _run_inner in the image and speech executors so the
post-span fall-through is the same shape across all four.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu marked this pull request as ready for review May 14, 2026 10:45
@timzsu timzsu requested a review from kaiitunnz as a code owner May 14, 2026 10:45
@timzsu timzsu mentioned this pull request May 14, 2026
9 tasks
Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu changed the title refactor: adopt mixin chain + emit per-phase spans (RFC #48) refactor: adopt mixin chain + emit per-phase spans May 14, 2026
@timzsu timzsu force-pushed the zsu/rfc48-omni-mixins branch from 47dd136 to 188e305 Compare May 14, 2026 16:25
@timzsu timzsu requested a review from J1shen May 15, 2026 05:05
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments. Additional consideration:

Another cleanup you should consider is to move the run method from the Omni executor classes to OmniExecutorBase, define the _run_inner method as an abstract method whose spec parameter is of type TaskSpecStrictBase., and define a class attribute _TASK_SPEC_TYPE. In this way, OmniExecutorBase can call self.require_spec(task, self._TASK_SPEC_TYPE) inside the generic self.run.

A drawback of this approach is that you need to call assert isinstance(spec, <spec-type>) as the first line of every concrete _run_inner.

Comment thread src/worker/executors/omni_text2audio_executor.py Outdated
Comment thread src/worker/executors/omni_text2audio_executor.py
Comment thread src/worker/executors/omni_text2general_executor.py
Comment thread src/worker/executors/omni_text2general_executor.py Outdated
Comment thread src/worker/executors/omni_text2image_executor.py Outdated
Comment thread src/worker/executors/omni_text2image_executor.py
Comment thread src/worker/executors/omni_text2speech_executor.py Outdated
Comment thread src/worker/executors/omni_text2speech_executor.py
Each concrete omni executor's run() did the same five things: resolve
spec, dump dict, normalize out_dir, run the task span, upload artifacts
and traces. Move that boilerplate to OmniExecutorBase.run() and let
subclasses contribute via a _TASK_SPEC_TYPE class attribute plus an
abstract _run_inner whose first line is `assert isinstance(spec, ...)`
to recover the concrete type.

Also adopt the cast(list[str], raw_prompts) form for the prompt-string
narrowing in all four executors so the pattern reads identically.

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>
@timzsu timzsu requested a review from kaiitunnz May 15, 2026 08:54
Signed-off-by: Noppanat Wadlom <noppanat.wad@gmail.com>
Copy link
Copy Markdown
Collaborator

@kaiitunnz kaiitunnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@timzsu timzsu merged commit 08d9640 into main May 15, 2026
11 checks passed
@timzsu timzsu deleted the zsu/rfc48-omni-mixins branch May 15, 2026 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants